Background

Data

This dataset contains measurements of electricity consumption from a single household, taken at one-minute intervals over nearly four years. It includes various electrical quantities and some sub-metering data.

This archive includes 2,075,259 measurements collected from a house in Sceaux, located 7 km from Paris, France, between December 2006 and November 2010 (covering 47 months).

This data set has been sourced from the University of California, Irvine Machine Learning Repository. For more information, please visit the Individual household electric poower consumption Data Set (UC Irvine).


At what times of day and during which weeks or months is power consumption at its highest?


Load Data

Load Packages

library('dplyr')
library('lubridate')
library('ggplot2')
library('tidyr')
library('plotly')
library('psych')
library('corrplot')

Set Directory & Read File

setwd("/Users/robertoruizfelix/Downloads/")
raw_data = readLines("household_power_consumption.txt")
str(raw_data)
##  chr [1:2075260] "Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3" ...
head(raw_data)
## [1] "Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3"
## [2] "16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000"                                                        
## [3] "16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000"                                                        
## [4] "16/12/2006;17:26:00;5.374;0.498;233.290;23.000;0.000;2.000;17.000"                                                        
## [5] "16/12/2006;17:27:00;5.388;0.502;233.740;23.000;0.000;1.000;17.000"                                                        
## [6] "16/12/2006;17:28:00;3.666;0.528;235.680;15.800;0.000;1.000;17.000"

Convert Raw Data to a Data Frame

## 'data.frame':    2075259 obs. of  9 variables:
##  $ Date                 : chr  "16/12/2006" "16/12/2006" "16/12/2006" "16/12/2006" ...
##  $ Time                 : chr  "17:24:00" "17:25:00" "17:26:00" "17:27:00" ...
##  $ Global_active_power  : num  4.22 5.36 5.37 5.39 3.67 ...
##  $ Global_reactive_power: num  0.418 0.436 0.498 0.502 0.528 0.522 0.52 0.52 0.51 0.51 ...
##  $ Voltage              : num  235 234 233 234 236 ...
##  $ Global_intensity     : num  18.4 23 23 23 15.8 15 15.8 15.8 15.8 15.8 ...
##  $ Sub_metering_1       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Sub_metering_2       : num  1 1 2 1 1 2 1 1 1 2 ...
##  $ Sub_metering_3       : num  17 16 17 17 17 17 17 17 17 16 ...
##         Date     Time Global_active_power Global_reactive_power Voltage
## 2 16/12/2006 17:24:00               4.216                 0.418  234.84
## 3 16/12/2006 17:25:00               5.360                 0.436  233.63
## 4 16/12/2006 17:26:00               5.374                 0.498  233.29
## 5 16/12/2006 17:27:00               5.388                 0.502  233.74
## 6 16/12/2006 17:28:00               3.666                 0.528  235.68
## 7 16/12/2006 17:29:00               3.520                 0.522  235.02
##   Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
## 2             18.4              0              1             17
## 3             23.0              0              1             16
## 4             23.0              0              2             17
## 5             23.0              0              1             17
## 6             15.8              0              1             17
## 7             15.0              0              2             17

Data Card

Column Position Attribute Definition Example
1 Date Date in dd/mm/yyyy 11/14/2020
2 Time Time in hh:mm:ss 20:12:59
3 Global_Active_Power Household global minute-averaged active power (kW) 3.14
4 Global_Reactive_Power Household global minute-averaged reactive power (kW) 0.420
5 Voltage Minute-averaged voltage (V) 234.01
6 Global_intensity Household global minute-averaged current intensity (A) 15.15
7 Sub_metering_1 Energy sub-metering (watt-hour of active energy); corresponds to the kitchen. 16
8 Sub_metering_2 Energy sub-metering (watt-hour of active energy); laundry room. 1
9 Sub_metering_3 Energy sub-metering (watt-hour of active energy); electric water-heater and an air-conditioner. 10

Data Cleaning

# Omit missing values
data = na.omit(data)

Features & Feature Engineering

colnames(data)[7:9] = c("Kitchen(W/hr)", "Laundry_Room(W/hr)", "Electric_WaterHeater/AC(W/hr)")
data = data %>%
  mutate(
    `Total_metering(W/hr)` = `Kitchen(W/hr)` + `Laundry_Room(W/hr)` + `Electric_WaterHeater/AC(W/hr)`,
    Apparent_Power = sqrt(Global_active_power^2 + Global_reactive_power^2),
    Power_Factor = Global_active_power / Apparent_Power,
    Date = dmy(Date),
    DateTime = as.POSIXct(paste(Date, Time)),
    Time = hms(Time),
    Year = year(DateTime),
    Month = month(DateTime),
    Week = week(DateTime),
    Day = yday(DateTime)
    ) %>%
  select(-Date, -Time)
head(data)
##   Global_active_power Global_reactive_power Voltage Global_intensity
## 2               4.216                 0.418  234.84             18.4
## 3               5.360                 0.436  233.63             23.0
## 4               5.374                 0.498  233.29             23.0
## 5               5.388                 0.502  233.74             23.0
## 6               3.666                 0.528  235.68             15.8
## 7               3.520                 0.522  235.02             15.0
##   Kitchen(W/hr) Laundry_Room(W/hr) Electric_WaterHeater/AC(W/hr)
## 2             0                  1                            17
## 3             0                  1                            16
## 4             0                  2                            17
## 5             0                  1                            17
## 6             0                  1                            17
## 7             0                  2                            17
##   Total_metering(W/hr) Apparent_Power Power_Factor            DateTime Year
## 2                   18       4.236671    0.9951210 2006-12-16 17:24:00 2006
## 3                   17       5.377704    0.9967080 2006-12-16 17:25:00 2006
## 4                   19       5.397025    0.9957337 2006-12-16 17:26:00 2006
## 5                   18       5.411335    0.9956877 2006-12-16 17:27:00 2006
## 6                   18       3.703828    0.9897868 2006-12-16 17:28:00 2006
## 7                   19       3.558495    0.9891823 2006-12-16 17:29:00 2006
##   Month Week Day
## 2    12   50 350
## 3    12   50 350
## 4    12   50 350
## 5    12   50 350
## 6    12   50 350
## 7    12   50 350

New Columns

  • Total_metering(W/hr): Total metering-Watts per hour- of all utilities being metered

  • Apparent Power:

    \[ \text{Apparent Power} = \sqrt{\text{Global Active Power}^2 + \text{Global Reactive Power}^2} \]

  • Power Factor:

    \[ \text{Power Factor} = \frac{\text{Global Active Power}}{\text{Apparent Power}} \]

  • DateTime: Combined Date and Time

  • Time: As a time class

  • Year: Year of observation

  • Month: Month of observation in numerical form

  • Week: Week of observation in numerical form

  • Day: Day of observation in numerical form

Descriptive Statistics

##                               vars       n    mean     sd     min     max
## Global_active_power              1 2049280    1.09   1.06    0.08   11.12
## Global_reactive_power            2 2049280    0.12   0.11    0.00    1.39
## Voltage                          3 2049280  240.84   3.24  223.20  254.15
## Global_intensity                 4 2049280    4.63   4.44    0.20   48.40
## Kitchen(W/hr)                    5 2049280    1.12   6.15    0.00   88.00
## Laundry_Room(W/hr)               6 2049280    1.30   5.82    0.00   80.00
## Electric_WaterHeater/AC(W/hr)    7 2049280    6.46   8.44    0.00   31.00
## Total_metering(W/hr)             8 2049280    8.88  12.86    0.00  134.00
## Apparent_Power                   9 2049280    1.11   1.05    0.08   11.12
## Power_Factor                    10 2049280    0.96   0.06    0.56    1.00
## DateTime                        11 2049280     NaN     NA     Inf    -Inf
## Year                            12 2049280 2008.42   1.12 2006.00 2010.00
## Month                           13 2049280    6.45   3.42    1.00   12.00
## Week                            14 2049280   26.29  14.96    1.00   53.00
## Day                             15 2049280  181.03 104.74    1.00  366.00
##                                range   se
## Global_active_power            11.05 0.00
## Global_reactive_power           1.39 0.00
## Voltage                        30.95 0.00
## Global_intensity               48.20 0.00
## Kitchen(W/hr)                  88.00 0.00
## Laundry_Room(W/hr)             80.00 0.00
## Electric_WaterHeater/AC(W/hr)  31.00 0.01
## Total_metering(W/hr)          134.00 0.01
## Apparent_Power                 11.05 0.00
## Power_Factor                    0.44 0.00
## DateTime                        -Inf   NA
## Year                            4.00 0.00
## Month                          11.00 0.00
## Week                           52.00 0.01
## Day                           365.00 0.07

Power Factor (PF)

Measure of how effectively electrical power if being converted into useful work output.

  • A PF of 1 indicates that all the power is being used effectively for work, meaning there is no reactive power.

  • A PF smaller than 1 indicates that not all the power is being used effectively.

Since all Power Factors are above 55%, this indicates the efficient use of electrical power. Furthermore, it becomes evident that majority of the PF’s are above 90% indicating that there is minimal loss in electrical distribution systems. Thus, this household is not prone for a higher energy costs because utilities do not need to charge for the additional apparent power.

Visualizations (top-down)

Total Metering by Year

Group data by year

yearly_data = data %>%
  group_by(Year)

From the Bar plot above, it becomes evident that there was much less appliance use in 2006, lets investigate why?

months_by_year <- data %>%
  mutate(Month = format(DateTime, "%B")) %>%  # Extract the month name
  group_by(Year) %>%
  summarise(Months = list(unique(Month)))
months_by_year
## # A tibble: 5 × 2
##    Year Months    
##   <dbl> <list>    
## 1  2006 <chr [1]> 
## 2  2007 <chr [12]>
## 3  2008 <chr [12]>
## 4  2009 <chr [12]>
## 5  2010 <chr [11]>

From our tibble, we see that 2006 only has one month of data. Furthermore, 2010 has 11 months of data, but this is sufficient for our case as we will be conducting a time-series analysis. Thus, we must drop the year 2006 since there is insufficient data for our use.

# Remove 2006 data
data = data[data$Year != 2006, ] 

Metering by Month: Total vs. Average Line Graph

# Sum of Monthly data
monthly_data_total = data %>%
  group_by(Year, Month) %>%
  summarise(Total_Metering = sum(`Total_metering(W/hr)`), .groups = "drop")

#Average of Monthly data
monthly_data_avg = data %>%
  group_by(Year, Month) %>%
  summarise(Mean_Metering = mean(`Total_metering(W/hr)`), .groups = "drop")

Through both graphs that compare the average vs total metering for all years, it becomes evident that they are very similar and do not deviate from each other much. Examining the graph, it becomes evident that the first, second, and twelfth month of the year have the highest energy sub-metering. However, there is the exception of 2010 as there is no data for the twelfth month.

Metering by combined Months: Total vs. Average Box Plot

Similar to above, both graphs are very similar to each other, even after combining the yearly data, excluding the 12th month of 2010. Examining the boxes, it becomes very clear that the first, second, and twelfth month have the highest median as well as maximum. ## Metering by Weeks: Total vs. Average Line Graph

# Sum of Weekly data
weekly_data_ttl = data %>%
  group_by(Year, Week) %>%
  summarise(Total_Metering = sum(`Total_metering(W/hr)`), .groups = "drop")

# Average Weekly data
weekly_data_avg = data %>%
  group_by(Year, Week) %>%
  summarise(Mean_Metering = mean(`Total_metering(W/hr)`), .groups = "drop")

Looking at the average and total graphs, there are common peaks in similar times. Analyzing the graphs, the three highest energy sub-metering readings are at week 5, 48, and 52. Although the average graph peaks at week 53 instead of 52, it is important to note that there is no data for 2010 from week 48 onward. However, these numbers are one week apart and fall within the same month, approaching form a macro level. ## Metering by combined Weeks: Total vs. Average Box Plot

Looking at the box plots, which groups the data by years as opposed to keeping them distinct, the pattern is very similar to that of the line graphs above. Analyzing both graphs, the same peaks occur at weeks 8 and 48. However, now that the data is combined, it becomes evident by looking at the the second box plot that Week 52 has a higher energy sub-metering due to its max and median. Thus, week 8, 48, and 52 have the highest energy sub-metering values.

Sub-metering Category Visualizations

What sub-metering category uses the most energy?

Looking at both graphs, it becomes clear that the Electric Water Heater and Air Conditioning Systems use the most energy across all weeks.

Hourly Analysis

During what time of day is energy used the most?

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Looking at the graph above, it becomes clear that most energy is used from hours 8-9 (8-9 AM) and 20-21 (8-9 PM).

Conclusion

Through the various visualizations above, it becomes clear that most energy is used during: